%!TEX TS-program = xelatex
%!BIB TS-program = biber
%!TEX encoding = UTF-8 Unicode

\documentclass[12pt]{article}

\usepackage{amsmath,amssymb,amsthm,mathtools}
\usepackage{fontspec}
\setromanfont{Times New Roman}
\setsansfont{Arial}

\usepackage{geometry}
    \geometry{letterpaper}

\usepackage{setspace}
\usepackage{fullpage}
\usepackage{multirow}

\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
    \captionsetup[figure]{labelfont={rm,bf}, textfont={rm,it}}
    \captionsetup[table]{labelfont={rm,bf}, textfont={rm,it}}
\usepackage{float}
\usepackage{enumitem}
\usepackage{authblk}
\usepackage{xcolor}

\usepackage[authordate, backend=biber]{biblatex-chicago}
    \addbibresource{refs.bib}

\usepackage{tikz,pgfplots}
    \usetikzlibrary{calc,patterns,decorations.pathreplacing,trees}
    \usepgfplotslibrary{fillbetween}
    \pgfplotsset{compat=1.15}
    \tikzstyle{every picture}+=[font=\small]
    \tikzset{>=latex}

\usepackage{hyperref}
\usepackage{url}

\makeatletter
\newtheoremstyle{mystyle}
	{1em}
	{0.5em}
	{\singlespacing\addtolength{\@totalleftmargin}{3em}
	\addtolength{\linewidth}{-6em}
	\parshape 1 3em \linewidth}
	{}
	{\bfseries}
	{. }
	{ }
	{}
\makeatother

\usepackage[nameinlink,noabbrev]{cleveref}
	\theoremstyle{mystyle}
 	\newtheorem{lemma}{Lemma}
	\newtheorem{proposition}{Proposition}
	\newtheorem{defn}{Definition}

\usepackage{tocloft}
\usepackage[normalem]{ulem}

%%% Prevent the hyphenation in biblio
\makeatletter
\AtEveryBibitem{\global\undef\bbx@lasthash}
\makeatother

\newcommand{\Ehat}[1]{\widehat{\mathbb{E}}[Y_i|#1,\mathbf{f}]}
\newcommand{\Ehatf}{\widehat{\mathbb{E}}[Y_i|\mathbf{f}]}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\begin{document}

\title{Measuring How Much Judges Matter for Case Outcomes\footnote{This is one of several joint papers by the authors on judicial decision making in the federal courts; the ordering of names reflects a principle of rotation.}}

\author{Ryan Copus\footnote{Ryan Copus is an Associate Professor at the University of Missouri--Kansas City School of Law, 500 E. 52nd Street, Kansas City, MO 64110, USA. Email:~copusr@umkc.edu. ORCID 0000-0001-5242-9480.}~~~~~~~~~~Ryan Hübert\footnote{Ryan Hübert is an Associate Professor in the Department of Methodology at the London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Email:~r.hubert@lse.ac.uk. ORCID 0000-0003-1556-4127.}}

\date{March 2025}

\maketitle

\vspace{-1em}

\begin{abstract}
\noindent A large empirical literature examines how judges' traits affect how cases get resolved. This literature has led many to conclude that judges matter for case outcomes. But how much do they matter? Existing empirical findings \textit{understate} the true extent of judicial influence over case outcomes since standard estimation techniques hide some disagreement among judges. We devise a machine learning method to reveal additional sources of disagreement. Applying this method to the Ninth Circuit, we estimate that at least 38\% of cases could be decided differently based solely on the panel they were assigned to.
\end{abstract}

\begin{center}

\vspace{1em}

Forthcoming in the \textit{Journal of Law and Courts}

\vspace{1em}

\textit{Supplemental Information and replication files to be made available online.}

\end{center}

\thispagestyle{empty}
\setcounter{page}{0}
\newpage

\part*{}

\doublespace

How much do judges matter for the resolution of legal cases? This question haunts the American legal profession. During the confirmation hearings for Chief Justice John Roberts, he argued that ``judges wear black robes, because it doesn't matter who they are as individuals. That's not going to shape their decision'' \parencite[][p.~178]{RobertsHearing2005}. But other judges have vociferously rejected this notion, offering a range of rationales for why it is unreasonable to presume (or even aspire to the notion that) judges don't matter for case outcomes. Perhaps most famously, then-Judge Sonia Sotomayor said in a 2001 speech that ``I would hope that a wise Latina woman, with the richness of her experiences, would more often than not reach a better conclusion than a white male who hasn't lived that life'' \parencite[][p.~92]{Sotomayor2002}.

Among those who study the legal system, there is widespread agreement that judges \textit{do} matter for the resolution of cases. But just how much they matter is a source of debate. In this paper, we have a simple goal: to offer a novel way to quantitatively measure the extent to which judges matter for case outcomes. 

We build on an unusually rich set of prior research findings. An enormous empirical literature examines the myriad ways that cases could get resolved differently depending solely on the circumstances under which they are heard. Perhaps most importantly, judicial politics scholars have studied how case outcomes depends on judges' traits---their ideologies, races, genders, etc.\footnote{For our purposes here, we set aside methodological debates about the extent to which these analyses identify \textit{causal} effects, a point that is developed in, for example, \textcite{hubert_copus_jop} and \textcite{chp2024}.} For example, many studies have documented that federal appeals are resolved differently when assigned to majority Republican panels instead of majority Democratic panels \parencite[e.g.,][]{Revesz1997,Sunstein2006,Epstein2013}. Others have documented that federal appeals are resolved differently when assigned to panels containing women or Black judges instead of all-male or all non-Black panels \parencite[e.g.,][]{Farhang2004,Boyd2010,Kastellec2013}. 

These studies provide important data points for evaluating the extent to which judges matter. Their focus is on characterizing the extent to which judges that share certain traits systematically disagree with other judges. Systematic disagreement among judges is also our focus,\footnote{We set aside the issue of whether judges are disagreeing with themselves across cases---i.e., intra-judge disagreement---which has been the subject of many prior studies \parencite[e.g.,][]{Kahneman2021, chen2016decision}.} but our starting point is that empirical estimates like these almost always obscure some amount of the \textit{overall} disagreement among judges or panels of judges. 
They are therefore best understood to be downward biased estimates of the overall disagreement between judges. To be clear, this is not intentional, as these estimates typically are not meant to quantify overall disagreement among judges. But, if one were to try to learn about overall disagreement by looking at these findings, they will generally understate the overall extent to which judges matter for case outcomes.  

The core methodological issue is well known: estimates of average treatment effects (ATEs) can mask underlying heterogeneity. In particular, there are two distinct ways that the ``standard'' judicial politics ATEs mask heterogeneity. First, by lumping together a collection of judges by a shared trait (e.g., political ideology, race or gender), estimates will not pick up on important differences among judges who share that trait \parencite[see also][]{Giles2001}. Second, even when a group of judges who share a trait behave similarly to one another, it is possible that, as a group, they respond differently to different kinds of cases. For example, Democratic appointees may be more likely than Republican appointees to reverse a lower court decision favoring the defendant but less likely to reverse a lower court decision favoring the plaintiff. Effects going in different directions cancel out when averaging, making it seem like there are smaller differences among judges than there truly are. 

In this paper, we offer a new way to quantitatively characterize the extent of disagreement between judges that reveals substantially more disagreement than these traditional ATEs in the courts literature. Our core innovation is to recast the methodological problem above as one of developing a new treatment variable that, by construction, minimizes heterogeneity in unit-level effect directions. Then, an average treatment effect estimated using this new treatment variable will (at least in principle) reveal the full extent of disagreement between judges. 

We begin with a simple theoretical model that allows us to derive a \textit{monotonicity-robust treatment} (or MRT) that we formally demonstrate yields an unbiased estimate of disagreement. While this treatment variable is primarily a statistical creation that allows us to more accurately estimate disagreement, it also has a substantive interpretation. For example, if the outcome of interest in a particular setting is whether a case is reversed (as in our empirical application), then a binary version of the MRT indicates whether cases are assigned to the panel more likely to reverse it or to the panel less likely to reverse it. Importantly, this is a unit-by-unit determination. For example, Panel A may be more likely to reverse than Panel B on Case 1, but less likely to reverse on Case 2. In this scenario, Case 1 would be in the MRT ``treatment group'' if it was assigned to Panel A, while Case 2 would be in the MRT ``treatment group'' if it was assigned to Panel B.

The core practical challenge is measuring MRTs accurately using real world data. Any measurement error in an MRT will cause resulting estimates of disagreement to be downward biased since measurement error means the ``hidden'' heterogeneity in the dataset has not been fully eradicated. Since all real-world measures are measured with some error, any estimate of disagreement using our technique will be somewhat downward biased. To mitigate this problem, we develop a machine learning method for measuring MRTs, which is designed to aggressively minimize measurement error and can be applied in a wide variety of contexts. We demonstrate that this method for measuring MRTs is robust and generates quantitative estimates that reveal substantially more disagreement among judges than traditional ATEs. 

We apply our method to an original dataset of civil appeals heard by the Ninth Circuit from 1995 to 2013. We begin by measuring an MRT for the cases in our dataset, which in our specific application, we term the ``panel reversal quantile'' (or PRQ). We show that the PRQ we measure preserves random assignment, and has strong face and construct validity. Since PRQs are meant to measure a latent trait---i.e., the reversal proclivity of a panel---an assessment of the construct validity of our PRQs requires us to demonstrate that our measure does indeed correlate with whether cases are reversed. As we discuss in more detail below, since our PRQs are measured entirely out of sample using a cross-validation approach, it is not a foregone conclusion that they will be correlated with whether cases are reversed. Our measurement strategy might not work, meaning that PRQs could have low construct validity. We show that in our dataset, PRQs are strongly correlated with whether a panel reverses or affirms, and even more strongly correlated with case outcomes than political ideology.

Using our newly measured PRQs, we then quantitatively characterize disagreement among the panels of judges in the Ninth Circuit by calculating the frequency with which reversals of lower court decisions would have been affirmances had they been assigned to different panels of judges (and vice versa). Since we are seeking to calculate a summary measure of disagreement between panels in a court that has many unique panels that hear cases,\footnote{In our dataset, there are 3,130 unique three judge panels with 371 unique judges.} it is not immediately obvious how to aggregate disagreement between each pair of panels to an overall court-level estimate. We calculate three different summary measures of disagreement in the Ninth Circuit, which we argue are highly informative about how much judges matter for outcomes in the court. 

First, we divide up our dataset into PRQ quintiles so that cases are assigned to one of five treatment arms indicating differing levels of panel reversal proclivity. We show that, as compared to the lowest (least reversal prone) quintile, cases assigned to panels in the third, fourth and fifth quintiles are significantly more likely to be reversed. For example, cases in the fifth PRQ quintile are at least\footnote{As we discuss below, since our PRQs are measured with measurement error, our estimates are always lower bounds, or ``floors'' on the true estimates.} 16\% more likely to be reversed than cases in the first quintile. Second, we ask: what share of cases \textit{could have} come out differently solely based on panel assignment? In other words, if we switched the case loads of the most reversal-prone and the least reversal-prone panels, how many cases would have come out differently? We estimate that at least 38\% of cases could have come out differently. Third, we ask: if all cases had been randomly reassigned, how many of them would come out differently? We estimate that at least 6.5\% of cases would have come out differently if all the cases in our dataset had been randomly re-assigned. Importantly, these estimates capture disagreement among judges and not other ``non-judicial'' factors. They therefore give us two quantitative measures of the extent to which judges matter for case outcomes in our dataset of Ninth Circuit cases.

In this paper, we take as given that it is important to quantify disagreement between judges because it allows us to empirically understand how much judges matter for case outcomes. Indeed, quantitative estimates like these speak to weighty normative issues relating to nature of justice in the U.S., as well as policy debates over the functioning of the courts, such as whether the Ninth Circuit is too big \parencite[e.g.,][]{Kozinksi2006}. However, many court scholars and observers want to know more than just the extent to which judges disagree in cases. They often seek to understand \textit{why} judges disagree. We too find it interesting and important to understand the reasons that judges make systematically different decisions. We readily acknowledge that standard judicial politics ATEs have been carefully chosen to shed light on substantively important sources of disagreement, such as political ideology and personal background, even if they do not show the full extent of inter-judge disagreement. This is not our goal here. We are focused on quantifying the extent of disagreement, regardless of its sources.

We contribute most directly to a small number of recent studies attempting to quantify disagreement among decision makers \parencite[e.g.,][]{Fischman2014,Kahneman2016,Kahneman2021}. A core challenge that arises in this prior work is that disagreement is difficult to estimate. \textcite{Fischman2014} is the first to elucidate the averaging problem we describe. Much of our theoretical discussion is similar in spirit (although with some differences), but our core focus is different. While Fischman is primarily concerned with mathematically characterizing upper and lower bounds, we are focused on using novel computational techniques to try to aggressively push up the lower bound to reveal more inter-judge disagreement. Moreover, Fischman's approach to measuring the lower bound on inconsistency introduces finite sample bias, which requires a subsampling correction. Because our measurement technique does not involve taking absolute values, our method avoids introducing finite sample bias in the first place. \textcite{Kahneman2021} urges researchers to run experiments. For example, one could create simulated case materials and ask a set of decision makers to evaluate each one and come to a (hypothetical) decision. While this may have high internal validity (and help get around the averaging problem), it has low external validity to real world data. Our major contribution is to provide a method for mitigating the averaging problem so that researchers can better estimate disagreement between judges sitting in real-world courts.

Until now, judicial politics researchers have formed their impressions about how much judges matter for case outcomes based on disparate empirical estimates that understate the extent of disagreement among judges. By revealing more of the disagreement among judges, we think our method has the potential to allow scholars to peer into the black box of judicial decision making and see what else is there. While in this paper we are primarily focused on explicating the method (and applying it to a dataset of Ninth Circuit cases), in the conclusion we briefly touch on some potentially promising applications.

\section{Quantifying How Much Judges ``Matter''}

We use a simple formal model of appeals to precisely characterize what we mean when we talk about whether judges ``matter'' in our empirical context (the Ninth Circuit). We will not explore intra-panel dynamics in this article, so we treat panels as unitary actors. We will therefore interchangeably refer to ``judges mattering'' and ``panels mattering.'' Exploring intra-panel dynamics in our empirical setting is an interesting avenue for future research, but there are additional methodological challenges that would make it more difficult.\footnote{Our own conversations with officials at the Ninth Circuit, as well as prior academic research \parencite[e.g.,][]{Chilton2015}, suggests that judges may not be randomly allocated to panels. As a result, we cannot be confident about any inferences we draw about individual judges mattering for the outcome of cases.} Our analysis would easily extend to contexts where judges hear cases on their own, such as U.S. District Courts. In the main text below, we provide an abbreviated discussion of the model so that we can get quickly to the main points. In Online Appendix A, we analyze the model in detail.

In the model, cases are defined by sets of ``case features'' (labeled $\mathbf{f}$) as well as idiosyncratic ``fact patterns'' (labeled $x \in \mathbb{R}$). At an intuitive level, case features define clusters of similar cases (e.g., civil rights cases about racial discrimination brought by the EEOC), whereas case patterns represent the specific facts of a case that signal the strength of each litigant's arguments. More formally, case features define a specific case space \parencite[see][]{Lax2011a} over which there is a distribution of fact patterns. That is, $x$ is distributed according to some conditional distribution with probability density function $f(x|\mathbf{f})$.

Each case $i$ is assigned to a panel $p$, which issues a decision $y_i(p,x_i,\mathbf{f}_i)$ upon seeing $x_i$ and $\mathbf{f}_i$. For ease of exposition, we will just assume $y = 1$ indicates a decision to reverse a lower court decision, and $y = 0$ indicates a decision to affirm a lower court decision. Each panel $p$ has an ideal point for each case space. Formally, we denote this ideal point as $\hat{x}_p(\mathbf{f})$, and assume that on a specific case $i$ drawn from the case space defined by $\mathbf{f}$, a panel $p$ strictly prefers $y_i = 1$ if $x_i \leq \hat{x}_p(\mathbf{f}_i)$ and strictly prefers $y_i = 0$ otherwise. Since each panel has its own ideal point for each case space $\mathbf{f}$, then two panels with ideal points $\hat{x}_1(\mathbf{f}_i) < \hat{x}_2(\mathbf{f}_i)$ will disagree about how to resolve a case $i$ whenever $\hat{x}_1(\mathbf{f}) < x_i \leq \hat{x}_2(\mathbf{f})$. In this situation, we say that judges ``matter'' for the case's outcome (see Definition 6 in Online Appendix A).

\subsection{Empirical Implications}

Consider a population of cases that are resolved according to the model of judicial decision-making summarized above. From an ex ante perspective, and given the uncertainty in the model, we can think of case outcomes as a random variable, $Y_i(p)$, which depends on the assigned panel. 
In the model, $Y_i(p)$ is well defined for all $p \in \mathcal{P}$, and it is on the equilibrium path if $p$ is actually assigned to case $i$, and off the equilibrium path otherwise. Using the terminology from the standard potential outcomes framework \parencite[see part 1 of][]{ImbensRubin}, for every $p \in \mathcal{P}$, $Y_i(p)$ is a \textit{potential outcome} of case $i$. In all of our analysis below, we will make a stable unit treatment value assumption (SUTVA). This means that we will assume that each case $i$ has a set of exactly $|\mathcal{P}|$ potential outcomes, one for each panel, which are ``stable'' in that they do not depend on how other cases were assigned to panels. We return to this below.

In the population of cases under consideration, how much do judges matter? Since we say that judges matter for outcomes when panels disagree about how a case should be resolved, we need to quantify how many cases feature inter-panel disagreement in order to quantify how much judges matter. It is not obvious how to quantify disagreement among a large set of potential decision-makers. We will work with a foundational definition of disagreement that is dyadic.

\begin{defn}\label{def:disagreement}
    For a population of cases, the \textbf{disagreement} between panels $p_1$ and $p_2$ can be quantified by
    \begin{align*}
        \delta(p_1,p_2) 
        \equiv 
        \mathbb{E}_i\big[|Y_i(p_1) - Y_i(p_2)|\big]
    \end{align*}
\end{defn}

At a theoretical level, this is how we formally quantify how much judges matter. We acknowledge there may be other conceptualizations of judicial disagreement, or for what it means for judges to ``matter,'' but we think ours is reasonable. It amounts to the simple idea that if a case would come out differently if assigned to another panel, then judges mattered for the outcome. To use this definition in the context of a court with more than two decision-makers, one has to decide which dyads of decision-makers to examine when quantifying how much judges matter. We will return to this issue further below, but we will first develop all of our core ideas imagining a setting with just two panels that could hear cases. 

Disagreement, as defined above, is a purely theoretical quantity since it is impossible to estimate it due to the fundamental problem of causal inference \parencite[][]{Holland1986}. However, there is another quantity that can, in principle, be estimated and which under certain conditions is equivalent to $\delta(p_1,p_2)$. 

\begin{defn}\label{def:disparity}
    For a population of cases, the \textbf{disparity} between panels $p_1$ and $p_2$ is given by
    \begin{align*}
        \phi(p_1,p_2) 
        \equiv 
        \big| \mathbb{E}[Y_i(p_1) - Y_i(p_2)] \big|
    \end{align*}
\end{defn}

Below, we show that the disparity between two panels can be estimated, but before we do, we must show that the disagreement between any two panels is equivalent to the disparity between those two panels. This is only true if the ideal points of the panels retain the same ordering across all cases in the population. We formally define this condition as follows.

\begin{defn}\label{def:monotonicity}
    For a case $i$, let $\mathbf{p}_i$ be a profile of sets of panels ordered in increasing order of ideal points and where each set contains all panels sharing an ideal point.\footnote{For example, with three panels with ideal points on case $i$: $\hat{x}_1 = \hat{x}_3 = 0.3$ and $\hat{x}_2 = 0.7$, then
    \begin{align*}
        \mathbf{p}_i = (\{p|\hat{x}_p = 0.3\}, \{p|\hat{x}_p = 0.7\}) = (\{p_1,p_3\},\{p_2\}).
    \end{align*}
    } A population of cases $\mathcal{M}$ satisfies \textbf{monotonicity} if and only if $\mathbf{p}_i = \mathbf{p}_j$ for all $i,j \in \mathcal{M}$.
\end{defn}

Our first formal result demonstrates that disagreement and disparities are equivalent in populations of cases where monotonicity holds. All proofs of formal results included in the main text are in Online Appendix B.

\begin{lemma}\label{lem:delta-phi-equiv} For a population of cases $\mathcal{M}$ such that monotonicity is satisfied, then $\delta(p_1,p_2) = \phi(p_1,p_2)$.
\end{lemma}

We have claimed that the disparity is estimable, but the definition above is still expressed in terms of counterfactual quantities. It can be estimated with observable quantities as long as the potential outcomes are independent of panel assignment. This is a well known idea from the Neyman-Rubin potential outcomes framework for causal inference. In our substantive context, it would be reasonable to assume independence of the potential outcomes if cases are randomly assigned to panels.

\newcommand{\indep}{\perp \! \! \! \perp}

\begin{defn}\label{def:randomization}
    For a population of cases $\mathcal{R}$, let $A_i$ indicate the panel assigned to case $i \in \mathcal{R}$. Then, $\mathcal{R}$ satisfies \textbf{random assignment} if and only if $Y_i(p)  \indep A_i \text{ for all } p$.
\end{defn}

In most applied empirical settings (including ours), cases can only be considered randomly assigned conditional on some known confounders. For example, cases may be randomly assigned within a courthouse and within a period of time. For exposition, our following results presume unconditional random assignment. However, they can be easily modified to accommodate random assignment conditional on known confounders \parencite[e.g., see p. 54 of][]{Angrist2009}. The next result formally shows that the disparity above can be estimated if random assignment holds.

\begin{lemma}\label{lem:randomization} For a population of cases $\mathcal{R}$ that satisfies random assignment, 
\begin{align*}
    \phi(p_1,p_2) = \big| \mathbb{E}[Y_i(p_1) - Y_i(p_2)] \big| = \big|\mathbb{E}_i&[Y_i|p_1] - \mathbb{E}_i[Y_i|p_2]\big| \equiv D(p_1,p_2)
\end{align*}
\end{lemma}

It is well known that if treatment is randomly assigned, then an unbiased and consistent estimator for the difference in means is the difference in sample means \parencite[see, for example, Theorem 16.3 in][]{Wasserman2004}, which we label $\widehat{D}(p_1,p_2)$. We now have our first major result.

\begin{proposition}\label{prop:conditions}
    In a sample of cases, the sample disparity between panels $p_1$ and $p_2$, $\widehat{D}(p_1,p_2)$, is an unbiased and consistent estimator for disagreement if:
    \begin{itemize}\itemsep-6pt\topsep=-6pt
        \item[(i)] the population of cases satisfies random assignment; and 
        \item[(ii)] the population of cases satisfies monotonicity. 
    \end{itemize}
\end{proposition}

\subsubsection{What if these conditions aren't satisfied?}

Unless both conditions in Proposition 1 are satisfied, the sample disparity will not be an unbiased and consistent estimator for disagreement. We now characterize what happens when each of the three conditions is not satisfied. The most straight-forward of these is the first, random assignment, as it is already well known that a difference in means estimator may be biased in the absence of independence of potential outcomes and treatment assignment. Because failure to satisfy random assignment prevents us from making a clear statement about the link between the estimator and the estimand, we use the common practice of referring to it as an ``identification'' problem. 

\begin{proposition}[The Identification Problem]\label{prop:identification} For a population of cases, if random assignment is not satisfied, then it is possible that $\phi(p_1,p_2) \neq D(p_1,p_2)$.
\end{proposition}

On the other hand, if monotonicity is not satisfied, then the disparity between panels $p_1$ and $p_2$ will always understate true disagreement, as the following proposition shows.

\begin{proposition}[The Averaging Problem]\label{prop:averaging} For a population of cases, if monotonicity is not satisfied, then $\delta(p_1,p_2) > \phi(p_1,p_2)$.  
\end{proposition}

The conceptual point underlying this proposition is not original to us, and has been discussed elsewhere, and most prominently in \textcite{Fischman2014}. But, the basic idea is intuitive. Since unit-level treatment effects can be positive or negative, if one averages them before taking the absolute value, this will push the magnitude of the resulting estimate toward zero. We term this the ``averaging problem.''

To see this more concretely, consider two hypothetical panels depicted in \Cref{fig:disagreement-vignette} who are randomly assigned to hear five cases each. A black circle indicates a panel would reverse the lower court decision and a white circle indicates a panel would affirm. In the observed dataset depicted on the left, Panel 1 reverses in 60\% of cases while Panel 2 reverses in 80\% of cases. Then we would calculate a sample disparity of 20\%.

\begin{figure}[ht]
    \centering
    \caption{A hypothetical example of a court that hears ten cases, randomly split among two panels. The left panel shows an observed dataset and the right panel shows all potential outcomes for all ten cases. A black circle indicates a reversal of the lower court decision and a white circle indicates an affirmance of the lower court decision.}
    \label{fig:disagreement-vignette}
    \begin{tikzpicture}[scale=0.66]

    \begin{scope}
    \filldraw[white] (-1.6,-0.75) rectangle (10.7,3.3);
    
    \node[anchor=south west] at (-1.7,3.75) {\textbf{Observed Dataset}}; 

    \draw (-1.7,3.75) -- (10.5,3.75) -- (10.5,-0.75)  -- (-1.7,-0.75) -- (-1.7,3.75);
    
    \draw[dashed,thick] (-1.6,1.5) -- (10.5,1.5);
    \draw[dashed,thick] (-1.6,0.5) -- (10.5,0.5);
    \draw[dashed,thick] (-1.6,-0.5) -- (10.5,-0.5);
    
    \node[anchor=east,yshift=-1pt] at (0.4,1) {Panel 1};
    \node[anchor=east,yshift=-1pt] at (0.4,0) {Panel 2};
    
    \foreach \x in {1,...,10}
    	\node[anchor=west, xshift=-0pt,yshift=12pt, rotate=90] at (\x,1) {\footnotesize Case \x};
    
    \foreach \x in {1,3,4,5,6,8}
    	\filldraw[black] (\x,0) circle (10pt);
    \foreach \x in {2,7,9,10}
    	\draw[thick,fill=white] (\x,0) circle (10pt);
    
    \foreach \x in {1,3,4,5,6,8}
    	\draw[thick,fill=white] (\x,1) circle (10pt);
    \foreach \x in {2,7,9,10}
    	\filldraw[black] (\x,1) circle (10pt);
    
    
    \foreach \x in {1,3,5,8,10}
    	{\filldraw[white] (\x cm-12pt,1cm-12pt) rectangle (\x cm+12pt,1cm+12pt);}
    \foreach \x in {2,4,6,7,9}
    	{\filldraw[white] (\x cm-12pt,-12pt) rectangle (\x cm+12pt,+12pt);}

    \end{scope}

    \begin{scope}[xshift=12.5cm]

    
    \filldraw[white] (-1.6,-0.75) rectangle (10.7,3.3);

    \node[anchor=south west] at (-1.7,3.75) {\textbf{All Potential Outcomes}}; 
    \draw (-1.7,3.75) -- (10.5,3.75) -- (10.5,-0.75)  -- (-1.7,-0.75) -- (-1.7,3.75);
    
    \draw[dashed,thick] (-1.6,1.5) -- (10.5,1.5);
    \draw[dashed,thick] (-1.6,0.5) -- (10.5,0.5);
    \draw[dashed,thick] (-1.6,-0.5) -- (10.5,-0.5);
    
    \node[anchor=east,yshift=-1pt] at (0.4,1) {Panel 1};
    \node[anchor=east,yshift=-1pt] at (0.4,0) {Panel 2};
    
    \foreach \x in {1,...,10}
    	\node[anchor=west, xshift=-0pt,yshift=12pt, rotate=90] at (\x,1) {\footnotesize Case \x};
        
    \foreach \x in {1,3,4,5,6,8}
    	\filldraw[black] (\x,0) circle (10pt);
    \foreach \x in {2,7,9,10}
    	\draw[thick,fill=white] (\x,0) circle (10pt);
    
    \foreach \x in {1,3,4,5,6,8}
    	\draw[thick,fill=white] (\x,1) circle (10pt);
    \foreach \x in {2,7,9,10}
    	\filldraw[black] (\x,1) circle (10pt);

    \end{scope}
    
    \end{tikzpicture}
\end{figure}

A disparity is informative about disagreement: a high disparity between two panels indicates that disagreement between them is also high. However, the converse is not true, since a low disparity (i.e., close to zero), does \textit{not} indicate a lack of disagreement. For example, consider again \Cref{fig:disagreement-vignette}. On the right, we depict all the potential outcomes for each panel, which demonstrates that these two panels would come to a different decision in every single case, yielding a 100\% disagreement. This is substantially more disagreement than the sample disparity revealed.

The underlying problem is that panel 1 is more inclined than panel 2 to reverse some cases, while the converse is true for other cases. For example, panel 1 is more inclined to reverse case 2, but panel 2 is more inclined to reverse case 3. This suggests that that the ordering of the two panels' ideal points may differ between case 2 and case 3.\footnote{To see this, recall that in a case space setting, two panels would come to different decisions if and only if a case's fact pattern is between their ideal points.} In other words, in this set of cases, monotonicity is not satisfied since the ordering of the panels' ideal points is not the same across all cases. The averaging problem is downstream of a violation of monotonicity.

\subsection{Solving the Averaging Problem by Measuring a New Treatment Variable}

We propose a solution to the averaging problem that entails measuring a new treatment variable that we call the monotonicity-robust treatment (or MRT). At a theoretical level, the basic idea is that a straight-forward transformation of the original treatment variable (i.e., panel assignment) can retain the informational content of that variable while ensuring that observed treatment effects all have the same sign. Specifically, under our MRT, a case is ``treated'' for a case $i$ if it was assigned to the panel with the higher ideal point for that case (and thus the panel more likely to reverse). Formally:

\begin{defn}\label{def:mrt} Let $a_i$ indicate the panel assigned to case $i$. Then, the \textbf{monotonicity-robust treatment (MRT)} is defined by:
\begin{align*}
    m_i(p_1,p_2) = 
    \begin{dcases}
        1 &\text{if }a_i \in \argmax_{p\in\{p_1,p_2\}}\{\hat{x}_p\}\\
        0 &\text{if }a_i \in \argmin_{p\in\{p_1,p_2\}}\{\hat{x}_p\}\\
        \varnothing &\text{otherwise }
    \end{dcases}
\end{align*}
\end{defn}

This definition of the MRT relies on unobservable quantities (the panels' ideal points), but there is an observable quantity that allows us to infer the orderings. In Lemma 4 in Online Appendix B, we show that $\mathbb{E}[Y_i|p_1, \mathbf{f}] < \mathbb{E}[Y_i|p_2, \mathbf{f}]$ if and only if $\hat{x}_1(\mathbf{f}) < \hat{x}_2(\mathbf{f})$. We can re-write the definition of $m_i(p_1,p_2)$ as follows:\footnote{An equivalent, but more notationally cumbersome way to write (\ref{eq:mon-rob-treat}) is: 
\begin{align*}
    m_i(p_1,p_2) = 
    \begin{dcases}
        1 &\text{if }\big[a_i = p_1 \text{ and } \mathbb{E}[Y_i|p_1,\mathbf{f}] > \mathbb{E}[Y_i|p_2,\mathbf{f}]\big]
        \text{ or }\big[a_i = p_2 \text{ and } \mathbb{E}[Y_i|p_2,\mathbf{f}] > \mathbb{E}[Y_i|p_1,\mathbf{f}]\big]\\
        0 &\text{if }\big[a_i = p_1 \text{ and } \mathbb{E}[Y_i|p_1,\mathbf{f}] < \mathbb{E}[Y_i|p_2,\mathbf{f}]\big]
        \text{ or }\big[a_i = p_2 \text{ and } \mathbb{E}[Y_i|p_2,\mathbf{f}] < \mathbb{E}[Y_i|p_1,\mathbf{f}]\big]\\
        \varnothing &\text{otherwise }
    \end{dcases}
\end{align*}}
\begin{equation}
\begin{aligned}\label{eq:mon-rob-treat}
    m_i(p_1,p_2) = 
    \begin{dcases}
        1 &\text{if }a_i \in \argmax_{p\in\{p_1,p_2\}}\{\mathbb{E}[Y_i|p,\mathbf{f}]\}\\
        0 &\text{if }a_i \in \argmin_{p\in\{p_1,p_2\}}\{\mathbb{E}[Y_i|p,\mathbf{f}]\}\\
        \varnothing &\text{otherwise }
    \end{dcases}
\end{aligned}
\end{equation}

Finally, we can define an average treatment effect using this MRT, which we refer to as the monotonicity-robust observable disparity (MROD), which we now show is, by construction, equivalent to disagreement whenever cases are randomly assigned.

\begin{proposition}\label{prop:trick} Define the \textbf{monotonicity-robust observable disparity (MROD)} as $M(p_1,p_2) \equiv \mathbb{E}\big[Y_i | m_i(p_1,p_2) = 1\big] - \mathbb{E}\big[Y_i | m_i(p_1,p_2) = 0\big]$. Then, for a population of cases satisfying random assignment, $M(p_1,p_2) = \delta(p_1,p_2)$.
\end{proposition}

We denote estimates of the MROD in a sample as $\widehat{M}(p_1,p_2)$. However, to estimate an MROD, we need to know $m_i$ for each case, which itself must be estimated. In principle, for two panels $p_1$ and $p_2$, we can estimate $m_i(p_1,p_2)$ for each case heard by  these panels by estimating $\mathbb{E}[Y_i|p,\mathbf{f}]$ and plugging into (\ref{eq:mon-rob-treat}) to yield $\widehat{m}_i(p_1,p_2)$. 

Unfortunately, estimates of $\mathbb{E}[Y_i|p,\mathbf{f}]$ from finite samples will be inaccurate. This will generate measurement error where some cases will be incorrectly classified as $\widehat{m}_i(p_1,p_2) = 1$ when in reality $m_i(p_1,p_2) = 0$, and vice versa. To the  measurement error reintroduces a (milder form) of the averaging problem,\footnote{Or, perhaps more accurately, measurement error means we do not fully solve the averaging problem.} which we formally show next.

\begin{proposition}[The Floor Problem]\label{prop:floor} If there is measurement error in $\widehat{m}_i(p_1,p_2)$, then $\widehat{M}(p_1,p_2) < M(p_1,p_2)$.
\end{proposition} 

We call this the floor problem because any MROD estimated with noisy measures of the MRT will only give a lower bound---or ``floor''---on the true estimate of disagreement. Since a researcher is always working with a finite dataset, there is nothing that can be done about the fact that there will be some error in the estimates of $\widehat{m}_i$; the floor problem always exists to some extent. However, the silver lining of the previous result is that we know the direction of the bias in our estimates: our estimates will always understate the true level of disagreement between panels. Plus, as the proof of the result implies, the bias due to the floor problem will decline as estimates of $m_i(p_1,p_2)$ become more accurate.

\subsection{Accommodating More than Two Panels}

Most courts have more than two panels. We can apply all the ideas above to a large court in a flexible manner. Recall that what matters for defining a MRT such as $m_i(p_1,p_2)$ is the \textit{ordering} of the two panels' ideal points, not their cardinality. So, for each case, if we order \textit{all} panels by their ideal points and determine which quantile the assigned case was in, then we can generate a ``composite'' measure of the MRT.


Consider the scenario in \Cref{fig:quantiles}, which depicts the case space for cases that share case features $\mathbf{f}$. There are 10 possible panels who could have been assigned to cases, but only one is actually assigned for each case. Two hypothetical cases are depicted. With respect to the panels' ideal points, the first case was assigned to a panel in the 30th quantile (30\% of panels have lower ideal points) and the second case was assigned to a panel in the 80th quantile (80\% of panels have lower ideal points). 


\begin{figure}[ht]
    \centering
    \caption{A hypothetical case space with case features $\mathbf{f}$ and ten panels with differing ideal points. Two cases---Case 1 and Case 2---are assigned to different panels, which are marked in the figure. Case 1 was assigned to a panel at the 30th quantile and Case 2 was assigned to a panel at the 80th quantile.}\label{fig:quantiles}
    \begin{tikzpicture}
        \begin{scope}[yshift=0cm]
            \draw[thick, <->] (0,0) -- (15,0) node[anchor=west] {$\mathbf{f}$};
            \draw[thick, ->] (5,1) node[anchor=south]{Case 1 panel} -- (5,0.2);
            \draw[thick, ->] (12.75,1) node[anchor=south]{Case 2 panel} -- (12.75,0.2);
            \foreach \x in {1,2,2.5,5,6,7,9,10,12.75,13.5}
                \draw (\x,4pt) -- (\x,-4pt) node[anchor=north,gray] {\strut $p$};
        \end{scope}
    \end{tikzpicture}
\end{figure}

If cases are randomly assigned, this means that case 1 was was randomly assigned to a ``30th quantile panel'' whereas case 2 was randomly assigned to a ``80th quantile panel.'' Formally, let ${a_i}$ be the panel assigned to case $i$. Then we can define the quantile to which case $i$ was assigned as:
\begin{align}\label{eq:prq}
    q_i = \frac{|\{p \in \mathcal{P} : \hat{x}_p < \hat{x}_{a_i}\}|}{|\mathcal{P}|} = \frac{|\{p \in \mathcal{P} : \mathbb{E}[Y_i|p, \mathbf{f}] < \mathbb{E}[Y_i|a_i, \mathbf{f}]\}|}{|\mathcal{P}|} \in [0,1]
\end{align}
(The latter equality is guaranteed by Lemma 4 from Online Appendix B.) We call $q_i$ a case $i$'s ``panel reversal quantile'' (which we abbreviate as PRQ) because they capture panels' proclivities to reverse. From an ex ante perspective, panels in low PRQs are less likely to reverse a randomly drawn case than panels in higher PRQs.

The PRQ is a continuous generalization of the binary MRT that we had defined above for a specific pair of panels. Then, we can analogously define an MROD for a pair of two specific PRQs such as $Q_1$ and $Q_2$:
\begin{align*}
    M(Q_1,Q_2) = \big|\mathbb{E}[Y_i | q_i = Q_1] - \mathbb{E}[Y_i | q_i = Q_2]\big|
\end{align*}
Since PRQs are a type of MRT, we will refer to them interchangeably in the following sections. 

\subsection{A Side Note on SUTVA}

We assume SUTVA holds in our setting. However, this assumption is likely to be controversial. First, since cases makes precedent---and more generally, judges and litigants learn from resolution of prior cases---a prior case's panel assignment might indeed influence future cases' outcomes. Second, each treatment (i.e., panel) is very likely to have different ``versions'' of itself that might amount to entirely different treatments. 

In our empirical application below, we take an important step to try to mitigate the threat of SUTVA violations. Specifically, all of our effects are estimated within each year. In addition to ensuring that we satisfy the random assignment assumption, this also reduces the impact that learning from prior cases has on the resolution for future cases. For example, it is much less reasonable to assume SUTVA when comparing cases decided in 1995 to cases decided in 2013 than it is when comparing cases decided in 1995 to other cases decided in 1995. 

Stepping back, however, SUTVA is implicit in all studies of judicial decision making that make a claim to unbiased effects, so this issue is not unique to our analysis. Exactly how and why SUTVA affects average treatment effects in judicial politics research is an interesting issue for future research.

\section{Measuring MRTs in the Ninth Circuit}

In this section, we measure MRTs in an original dataset of all civil appeals from district courts that were filed between 1995 and 2013 in the Ninth Circuit and that were randomly assigned to three-judge panels. To do this, we need to (1) justify the assumption that panels are randomly assigned to cases (see Proposition 2), (2) assure that our MRTs preserve randomization (i.e., are not correlated with pre-treatment characteristics), and (3) measure MRTs with as little measurement error as possible (see Proposition 5).

In order to measure MRTs and assure that they preserve randomization, we draw from the growing literature on estimating heterogeneous treatment effects with meta-learners. As explained below, we use modified version of the S-Learner and use cross-fitting to protect the assumption that the MRTs are as-if randomly assigned to cases. 
While we develop our approach using data from the Ninth Circuit, this approach could be easily adapted for other judicial decision-making contexts. However, we can not say anything general about how well it will work in all settings. While better algorithms, better data, and more data can reduce measurement error in MRTs, there is little that can be said about which exact combinations of algorithms, predictors, and data will sufficiently reduce measurement error. As a result, a key component of our approach is validation. Below, we validate our Ninth Circuit measures in two ways. First, we provide support for the assumption that the MRTs are not correlated with pre-treatment characteristics. Our measurement strategy thus preserves random assignment. Then, we demonstrate the strong face and construct validity of our measured MRTs. In particular, we demonstrate that the MRTs are indeed strongly predictive of case outcomes, which is what they are designed for.

Our dataset consists of 11,359 appeals and the outcome variable in our analyses is whether a case is affirmed or not. When a case is not affirmed, we generally refer to the outcome as a ``reversal,'' although that includes decisions to vacate or remand, decisions to reverse or vacate in part, and, on rare occasion (approximately 1\%), decisions labeled as ``Other.''\footnote{Though it may be tempting to drop cases with an ``Other'' outcome, it risks introducing post-treatment bias into any causal estimates \parencite[for a detailed discussion of this issue, see][]{hubert_copus_jop}.} Our dataset has 3,130 unique three-judge panels that are comprised of 371 unique judges.

\subsection{Verifying Random Assignment of Panels to Cases}

The Ninth Circuit reports that it randomly assigns panels of three judges to most of its cases. As we verified in a conversation with the Clerk of Court, some cases are pre-screened and assigned to panels non-randomly. We drop these cases from our analysis. Of course, case assignment is only random within a time period and region. Thus, when using MRTs to estimate treatment effects, we include region-year fixed effects.

We test the assumption that cases are randomly assigned to panels by testing if case characteristics can predict whether a panel is majority Republican. If they can, then this would indicate that certain types of cases are more likely to be assigned to certain types of judges, violating random assignment. Fortunately, in our dataset, we cannot predict judge partisanship with case characteristics. More specifically, \Cref{fig:partisan-randomization} shows that a stacking ensemble with access to case predictors is no better able to predict whether a case is assigned to a majority Republican panel than is a stacking ensemble with only access to region-year fixed effects. The results provide evidence that panels are randomly assigned to cases.

\begin{figure}[ht]
    \centering
    \caption{We plot two ROC curves for machine learning models that attempt to predict whether a panel is majority Republican (i.e., judge characteristics). One model has access only to region-year fixed effects (the red line), while the other model has access to these fixed effects plus all other case predictors in our dataset (blue line). These additional case predictors do not provide any additional predictive power, indicating that they are not associated with judge characteristics.}\label{fig:partisan-randomization}
    \includegraphics{../Outputs/fg3.pdf}
\end{figure}

\subsection{Measuring MRTs with a Modified S-Learner}\label{subsec:modified-s-learner}

Obtaining the most accurate measures of the relevant MRT (i.e., the PRQ $\hat{q}_i$) requires obtaining the most accurate ordering of $\Ehat{p}$ for all combinations of panels and case characteristics. For modeling complex interactions in a conditional expectation, it is now common to use machine learning, specifically ensemble learning. We use Automatic Machine Learning (AutoML) within H2O, an open-source environment in R. The stacking methodology employs supervised learning based on loss functions, leveraging $k$-fold cross-validation to determine the optimal combination of diverse base algorithms. The process begins by generating cross-validated predictions for each base learning algorithm in the ensemble, which may include generalized linear models, random forests, and neural networks. This is accomplished by dividing the dataset into $k$ folds, where training occurs on $k-1$ folds while generating predictions on the remaining fold. This procedure repeats $k$ times, ensuring each fold serves as validation data exactly once. Subsequently, the system regresses these predictions against actual outcomes to determine appropriate weights for each base algorithm. The resulting weighted combination yields an ensemble prediction function that is then applied across the entire data set. The approach is asymptotically optimal for learning outcomes \parencite{polley2010super}. Details regarding our ensemble learner are available in Online Appendix C.2.

But there is still the question of how to employ ensemble learning to achieve the most accurate ordering of $\Ehat{p}$. For this, we take guidance from the literature on meta-learners, techniques to employ machine learning to optimally estimate heterogeneous treatment effects. This literature has almost entirely focused on contexts with binary treatments \parencite{Goplerud}. Even though we have many more treatments (i.e., each unique panel), the research on meta-learners is instructive. 

One common meta-learner is the S-Learner, or the ``Single Learner.'' With a standard S-learner, we would fit a single model of $\Ehat{p}$. Then, using that model, we would generate predictions of $\Ehat{p}$ for each panel in each case. For a case $i$, the quantile of the assigned panel's predicted outcome in a distribution of counterfactual panels' predicted outcomes would be that case's estimated PRQ, $\hat{q}_i$. 

A well-understood problem with this approach is that, by maximizing accuracy of $\Ehat{p}$, the S-Learner may poorly estimate the \textit{ranking} of $\Ehat{p}$ among panels \parencite{salditt2024tutorial}. This is because the learner may place excessive weight on predictors other than the assigned panel, especially if those predictors are highly predictive of the outcome. (In fact, a standard S-Learner might entirely exclude panel variables from its fitted model!) Since case characteristics are highly predictive of outcomes, and data from each unique panel is sparse, it is likely that a standard S-Learner would place excessive weight on case characteristics and fail to accurately discern the ranking among panels. 

One way to ``force'' a learner to place sufficient weight on panel variables in its fitted model would be to estimate a separate model for each panel, e.g. separately fit $\Ehat{p_1}$, $\Ehat{p_2}$, and so on. In contexts where researchers are dealing with binary treatment variables, this kind of meta-learner is referred to as a T-Learner (short for ``Two Learner''). Unfortunately, a T-Learner is not an option in our context. As indicated by its name, it is designed for a binary treatment, but we have many more treatments---each unique panel is a treatment. And because each unique panel decided few cases together, it is not feasible to fit separate, high-quality models for each unique panel.\footnote{While there are other meta-learners (see, e.g., X-Learner), they also have not been adapted to contexts with a large number of treatments.}

So, we modify the S-Learner to increase algorithmic emphasis on ranking $\Ehat{p}$ for different panels. Recall that the core problem with the standard S-Learner is that the variable of core interest, the panel, may be excluded or given little weight. We make three modifications to address this problem.

\paragraph{Add panel characteristics.} As discussed above, because data from each unique panel is sparse, an S-Learner is unlikely to place sufficient weight on those variables when estimating $\Ehat{p}$. To improve the likelihood that the learner will use panel characteristics to fit its models, we add a collection of panel characteristic variables, such as how many Republican appointees are on the panel; the median, average, maximum, and minimum DIME score of the panel's judges \parencite{bonica_2017}; and dummy variables indicating whether each judge was on the panel. The full list of panel characteristics included in our learner is available in Table C.2 in Online Appendix C.1. By including panel characteristics in the model, we make it easier for the algorithm to predict how different panels would decide different cases. Formally speaking, this transforms the target of estimation from $\Ehat{p}$ to $\Ehat{p,\mathbf{c}}$, where $\mathbf{c}$ is a vector of panel characteristics.

\paragraph{Residualize outcome variable.} S-Learners are prone to put too much weight on highly predictive variables, such as case characteristics. To counteract that, we first ``residualize'' the outcome variable to remove information about case characteristics, and then change the target outcome of the S-Learner to this residualized outcome in order to better focus the learner on panel variables. The residualization process proceeds as follows. We first estimate $\mathbb{E}[Y_i|\mathbf{f}]$ with ensemble learning to best capture the variation in the outcome explained solely by case characteristics. We then isolate the residual variation in the outcome by subtracting those estimates from the outcome. We use those residuals as the target outcome for our S-Learner. More formally, our S-Learner's target of estimation is $\widehat{\mathbb{E}}[\widetilde{Y}_i|p, \mathbf{c}, \mathbf{f}]$, where $\widetilde{Y}_i = Y_i - \widehat{\mathbb{E}}[Y_i|\mathbf{f}]$. Note that we keep case features in our S-Learner despite residualizing because it is possible (indeed likely) that interactions between panel and case characteristics are predictive.

\paragraph{Screen for predictive case characteristics.} Our third, final, and most minor modification is to screen case characteristics for promising interactions with panel characteristics. That is, before estimating $\mathbb{E}[\widetilde{Y}_i|p, \mathbf{c}, \mathbf{f}]$, we select a subset of the case characteristics that have strong interactions with panel variables. This again modifies the S-Learner's target of estimation to $\widehat{\mathbb{E}}[\widetilde{Y}_i|p, \mathbf{c}, \tilde{\mathbf{f}}]$, where $\tilde{\mathbf{f}}$ is a smaller, pre-screened collection of case features.

The screening function we use to determine $\tilde{\mathbf{f}}$ is as follows:
\begin{enumerate}
    \item Run a LASSO regression on the panel predictors and select all panel predictors with a scaled importance greater than 0.8.
    \item Run a LASSO regression that interacts the selected panel predictors with all case predictors and select the case predictors with a scaled importance greater than 0.8.
    \item Include all panel predictors and only the case predictors selected in Step 2 in the ensemble, estimating $\widehat{\mathbb{E}}[\widetilde{Y}_i|p, \mathbf{c}, \tilde{\mathbf{f}}]$. 
\end{enumerate}

The cutoffs we used for scaled importance were selected via testing the performance on data that is not used in the analysis. This third modification does not substantially alter our estimates of $\hat{q}_i$. It is simply one last nudge for the S-Learner to focus on panel variables and their interactions with case features.

\subsection{Cross-Fitting to Preserve the Assumption that Panels are Randomly Assigned to Cases}

Above, we provided evidence that panels are randomly assigned to cases. For valid causal inference, it is critical that our new treatment variable, $\hat{q}_i$, preserves that randomization such that the newly constructed treatment variable is not associated with pretreatment characteristics.

Machine learning models that predict outcomes (like ours) can introduce bias when they are used to construct variables for downstream causal inference analyses. The core problem is that outcomes in a training set may be correlated with predictors in that training set, even though there is no true correlation. For example, a correlation in a training set could simply be spurious, which is possible due to random chance alone. This is a classic example of over-fitting. In our context, this would mean that estimated PRQs could be correlated with pre-treatment case characteristics even though actual panel assignments are random (as we showed above).

To deal with this problem, we draw on recent methodological work showing that cross-fitting can help preserve causal identification when using machine learning methods to estimate heterogeneous treatment effects \parencite{chernozhukov2018double}. In our context, cross-fitting helps preserve the random assignment assumption by ensuring that the predictions used to construct the PRQs are generated from models trained on different data than the data for which we are making predictions. Specifically, we:

\begin{enumerate}
\item Randomly partition our dataset into $K$ folds.
\item For each fold $k \leq K$:
\begin{itemize}
\item train our modified S-Learner on all folds \textit{except} fold $k$; and
\item use that model to generate predictions and construct PRQs only for observations in fold $k$.
\end{itemize}
\item Combine the PRQs from all $K$ folds to form our final dataset.
\end{enumerate}

This procedure ensures that the data used to construct each case's PRQ was not used to train the model that yielded that PRQ. As a result, any spurious correlations between our machine learning predictions and case characteristics that might arise during training cannot affect our PRQ measures. We use $K=10$ folds for our cross-fitting procedure, a standard choice in machine learning.

Importantly, we apply the cross-fitting procedure to both stages of our modified S-Learner: both when generating the initial panel-free predictions used for residualization (i.e., estimating $\widehat{\mathbb{E}}[Y_i|\mathbf{f}]$), and when generating the panel-specific predictions used to construct the PRQs (i.e., estimating $\widehat{\mathbb{E}}[\widetilde{Y}_i|p, \mathbf{c}, \tilde{\mathbf{f}}]$). This comprehensive approach helps ensure that we maintain the random assignment assumption throughout our entire estimation procedure. 

\subsection{Evidence that Random Assignment is Preserved}

Although cross-fitting should suffice to preserve randomization, we conduct an additional test for verification. We examine whether cases that are more likely to be reversed are disproportionately assigned to panels with higher PRQs.

To implement this test, we employ the predictions from an ensemble learner trained on only case features that we obtained to residualize outcomes in Section~\ref{subsec:modified-s-learner}. These predictions represent the baseline probability that each case would be reversed, independent of which panel hears it. If our measured PRQs preserve random assignment, they should not be systematically related to these ``reversibility'' predictions.

We test for potential non-random assignment by regressing PRQs on the panel-independent predictions of reversal, including region-time fixed effects. The results support the assumption that random assignment is preserved: we estimate a statistically insignificant coefficient of –0.07 ($p$-value: 0.11). Thus, we do not find evidence that more reversable cases are being disproportionately assigned to panels with higher PRQs. This finding, combined with our use of cross-fitting and our earlier evidence that panels are randomly assigned to cases, provides strong support for the validity of our measurement strategy. 

\subsection{Face Validity of PRQs}

In \Cref{fig:judge-prq-distributions}, we provide some substantive texture for our PRQ variable, which demonstrates that it actually captures patterns of decision making that Ninth Circuit observers would find intuitive. First note that the cases heard by panels including Judge Reinhardt tend to be clustered at the high end of the PRQ distribution. This indicates that these panels are unusually likely to reverse the cases that they are assigned. It is thus noteworthy that Judge Reinhardt's decision making earned him a number of nicknames which included ``Bad Boy of the Federal Judiciary.'' On the other hand, consider the cases heard by panels including Judge Kozinski. While these are more concentrated at lower percentiles, they are more spread out. This suggests that Judge Kozinski's presence on these panels was more moderating than Judge Reinhardt, perhaps unsurprising given that he was Chief Judge during much of the period we study and was thus likely to have been especially concerned with the overall operation of the court and the collegiality between judges.

\begin{figure}[ht]
    \centering
    \caption{We plot the distribution of PRQs for cases assigned to panels containing Judge Reinhardt and the distribution of PRQs for cases assigned to panels containing Judge Kozinski. The former indicates that Judge Reinhardt was unusually influential, since cases assigned to his panels were much more inclined to reverse than the court norm. The latter indicates that Judge Kozinski was more conciliatory, since cases assigned to his panels were distributed fairly uniformly across PRQs. This is an indication that he ``went along'' with the other judges on his panel.}
    \label{fig:judge-prq-distributions}
    \includegraphics{../Outputs/fg4.pdf}
\end{figure} 

\subsection{Construct Validity of PRQs}

We have argued---and shown formally---that PRQs capture the extent to which an assigned panel is inclined to reverse or affirm. Before we proceed to our substantive analysis in which we empirically quantify disagreement among panels in the Ninth Circuit, we demonstrate that our measured PRQs have strong construct validity. 

In particular, PRQs will have strong construct validity if they strongly predict whether a case is more or less likely to be reversed. To demonstrate construct validity, we bin cases into PRQ deciles and calculate the mean reversal rate in each decile. We plot this in blue in \Cref{fig:construct-v}, which shows that PRQs are strongly correlated with reversal rates. The correlation between PRQ decile and reversal rates is 0.92.

We further demonstrate the strength of our measure by comparing it to a different treatment variable---political ideology---that has been shown to explain substantial disagreement between panels \parencite[see, ch. 3 of][]{Friedman2020}. In \Cref{fig:construct-v}, we show the correlation between DIME scores and reversal rates in red.\footnote{In the plot, we order DIME scores in reverse order so that higher percentiles are lower DIME scores. We do this so that the correlation between DIME and reversals is the same sign as the correlation between PRQs and reversals. This makes it easier to see the difference in correlations.} The correlation is substantially weaker (0.64), indicating that political ideology (at least as measured by DIME scores) do not explain as much disagreement between panels as we have been able to explain with PRQs.

\begin{figure}[ht]
    \centering
    \caption{We show the correlation between PRQs and reversal rates (in blue), and between DIME scores and reversal rates (in red). For the latter, we use the median DIME score of each assigned panel, which we then normalize into percentiles for ease of comparison.}
    \label{fig:construct-v}
    \includegraphics{../Outputs/fg5.pdf}
\end{figure}

\section{How Much Do Judges Matter in the Ninth Circuit?}

Now that we have measured a new MRT, we can perform several analyses to characterize how much judges matter in the Ninth Circuit. In real-world datasets like ours, the number of cases heard by panels at each PRQ will be fairly small. So, if we try to estimate MRODs with specific PRQs our MROD estimates will be very imprecise. Just to illustrate, suppose we wanted to estimate an MROD to quantify disagreement between the panels exactly at the 10th PRQ and exactly at the 90th PRQ. In our dataset, there are 2 cases at the 10th percentile and 1 case at the 90th percentile. Obviously, estimating this MROD is not feasible.

The easiest and simplest way to deal with this is to simply ``bin'' PRQs into (approximately) equal-sized intervals. For example, if we bin into five groups, then all cases whose assigned panel has a PRQ less than or equal to 0.20 will be ``treated'' to the first quintile. Obviously, the downside of doing this is that we are consolidating potentially very different panels into single treatment groups. This will mechanically tend to yield lower estimates of disagreement in exchange for more precise ones.\footnote{To take a simple example, an MROD comparing the lowest percentile to the highest percentile will, in theory, yield a larger but much noisier estimate of disagreement than an MROD comparing the lowest quartile to the highest quartile. The former comparison compares more extreme outlier panels than the latter comparison, but there are many fewer of them.}

\subsection{Comparing Quintiles}

We begin by binning the PRQs into five equal bins, or PRQ quintiles. In \Cref{fig:disagreement-quintiles} we show the estimated effect of assigning cases to different PRQ quintiles relative to the lowest PRQ quintile. For example, assignment to a panel in the highest PRQ quintile rather than the lowest PRQ quintile results in an approximately 16 percentage point increase in the reversal rate. 

\begin{figure}[ht]
    \centering
    \caption{The effect of assigning cases to panels predicted to be more likely to reverse. The reference group is cases assigned to panels in the lowest PRQ quintile. Error bars reflect 95\% confidence intervals. Point estimates and standard errors (in parentheses) are also included above each confidence interval.}
    \label{fig:disagreement-quintiles}
    \includegraphics{../Outputs/fg6.pdf}
\end{figure}
\subsection{Comparing Extremes of the PRQ Distribution}

We now estimate several MRODs that allow us to quantify the extent to which judges \textit{could} matter by comparing outcomes in cases heard by the most outlier panels in the PRQ distribution. Of course, the practical difficulty is again in choosing which outlier panels to compare. Too far on the extremes of the distribution then our estimates will be quite noisy; not but not far enough, and we will uncover less hidden disagreement. We thus test a number of different options. \Cref{fig:disagreement-maximum} displays the estimated effect of assigning a case to the top X\% PRQs relative the lowest X\% PRQs, where X can be either 10, 5, 4, 3, 2 or 1.

\begin{figure}[ht]
    \centering
    \caption{Estimates of the extent to which judges matter for case outcomes in the Ninth Circuit. The leftmost estimate is the estimated effect on the likelihood of reversal from re-assigning cases that were assigned to the 10\% of panels with the lowest predicted probability of reversing them to the 10\% of panels with highest predicted probability of reversing them. Each subsequent comparison is of the same form (e.g., lowest 5\% versus highest 5\%). Error bars reflect 95\% confidence intervals. Point estimates and standard errors (in parentheses) are also included above each confidence interval.}
    \label{fig:disagreement-maximum}
    \includegraphics{../Outputs/fg7.pdf}
\end{figure}  
Substantively, each of these estimates gives the percentage of appeals that would be decided differently if each case were assigned to a panel most likely to reverse it versus a panel least likely to reverse it. Looking at the right-most estimate, we estimate that at least 38\% of cases would be decided differently if they were assigned to a panel in the top 1\% of panels most likely to reverse instead of to a panel in the bottom 1\% of panels most likely to reverse (and vice versa).

\subsection{How Much Would Re-Randomization Change Outcomes?}

Another potentially interesting quantity for evaluating how much judges matter for case outcomes is to calculate what percentage of cases would be decided differently had cases been re-randomized. In other words, how many cases' outcomes were due solely to the random allocation of their assigned panel?

Formally, this amounts to calculating the average of a large set of (pairwise) MRODs. For example, suppose we bin PRQs into quartiles, then the average MROD would simply average each pairwise MROD across all combinations of the four quantiles. \Cref{fig:disagreement-quantile-partitions} plots average MRODs using increasingly fine binning of the PRQs. As the bins become more numerous, our estimates increase because we are uncovering more hidden disagreement among judges. The estimates eventually level off once increasing the number of partitions no longer helps us uncover additional hidden disagreement. The resulting ``asymptote'' is our best estimate of how many cases would be decided differently if they were  randomly re-assigned. In this case, we estimate that at least 6.5\% of cases would be decided differently all the cases in our dataset were randomly re-assigned.

\begin{figure}[ht]
    \centering
    \caption{Estimates of how many cases would have a different outcome if they were randomly re-assigned to panels. Estimates are the average of the pairwise estimated effects of assigning cases from a lower quantile partition to a higher quantile partition, including an effect of zero for each partition (allowing for cases to be re-assigned to a panel in the same quantile range).}
    \label{fig:disagreement-quantile-partitions}
    \includegraphics{../Outputs/fg8.pdf}
\end{figure} 

\section{Conclusion}

Quantifying how much judges matter for case outcomes is critical to evaluating the American courts. If there are stark differences in the way judges resolves cases, this casts doubt on the notion that judges are simply ``neutral arbiters'' and raises questions about whether judge-made law can ever be truly consistent. 

Yet, decades after the quantitative revolution in judicial politics research, there are still serious challenges to identifying the full extent of disagreement among judges. Traditional average treatment effects paint an incomplete and piecemeal picture of the total amount of disagreement among judges. We demonstrate how advances in machine learning can be leveraged to create a treatment variable that is optimized for quantitatively exposing disagreement between decision makers. With the introduction of our monotonicity robust treatment variable, the PRQ, we hope to encourage the development of a more robust and wide-reaching quantitative literature evaluating the breadth of judicial influence over cases.

There are many ways that high quality estimates of disagreement can aid substantive scholarly research on courts. Perhaps most obviously, advances in estimating disagreement among judges could help resolve the debate over whether the Ninth Circuit's exceptional size has resulted in heightened levels of decision making inconsistency. Our method might also help the very research it has taken inspiration from: the judicial politics literature that focuses on how politics, race, and gender influence decision making. Scholars in that field might use our aggressive method for uncovering disagreement to evaluate the plausibility of theoretical explanations. For example, if inter-judge disagreement is much higher than an average treatment effect motivated by a theoretical explanation, this provides information about the relative importance of the theoretical explanation (similar to the way R-squared is sometimes interpreted). We could also imagine court scholars using our method to identify outlier decisions so as to explore strategic judicial behavior with those decisions---e.g., do judges tend to leave outlier decisions unpublished so as to avoid drawing attention from their colleagues? We think the possibilities are plentiful, and we encourage researchers to take the study of judicial disagreement seriously. 

\printbibliography

\newpage
\appendix

\begin{refsection}

\setlength\cftparskip{0pt} 

\setlength\parindent{0pt}
\setlength\parskip{12pt}

\setcounter{footnote}{0}

\doublespacing

\setcounter{table}{0}
\setcounter{figure}{0}
\setcounter{page}{1}
\renewcommand{\thepage}{A\arabic{page}}
\renewcommand{\thetable}{\Alph{section}.\arabic{table}}
\renewcommand{\thefigure}{\Alph{section}.\arabic{figure}}

\singlespacing

\Large 
\noindent\textbf{Online Appendix for} \\
``Measuring How Much Judges Matter for Case Outcomes''

\normalsize

\vspace{1em}

Ryan Copus, UMKC School of Law\\
Ryan Hübert, London School of Economics and Political Science

\vspace{1em}

\noindent March 2025

\vspace{1em}

\noindent\textit{For Online Publication}

\vspace{1em}

\noindent\textit{Replication code and data will be made available.}

\vspace{1em}

\section{Formal Model of Judicial Decision Making}\label{app:model}

\paragraph{Cases.} Each case is defined by a finite collection of ``case features'' and an idiosyncratic stochastic element that is the case's ``fact pattern'' and captures how favorable the case is for the appellee. Formally, a case $i$ will have a vector of case features, $\mathbf{f} \in F_1 \times F_2 \times \cdots \times F_{f} \equiv \mathcal{F}$ and a random component ${x} \in \mathbb{R}$, where each $F_1, F_2, ..., F_{f}$ represents a different class of case features and ${x}$ is randomly distributed according to some known probability distribution with pdf $f(x|\mathbf{f})$.

The parameter ${x}$ captures aspects of the case that are specific to the case and separate from the case features. While these aspects may be ``deterministic'' in some metaphysical sense, from the perspective of the actors in the model (and the model's analysts), they seem as-if random. We thus model them as such. 

\paragraph{Panels.} 
The court has a set $\mathcal{P}$ panels that are available to hear cases, where $|\mathcal{P}|=P$. There is a case assignment mechanism, which maps a panel to the case. Formally, this is a stochastic mapping $\alpha : \mathcal{P} \to [0,1]$, giving the probability that each panel is assigned to the case, and where $\sum_{p \in \mathcal{P}} \alpha(p) = 1$. This mapping could be deterministic if, for example, $\alpha(p) \in \{0,1\}$ for all $p \in \mathcal{P}$. But in practice, courts in the U.S. assign most cases using a random process. The assignment mechanism generates a presiding panel, which hears the case and renders a decision $y \in \{0,1\}$. For example, the decision $y = 1$ could represent a judgment favoring the plaintiff (where $y = 0$ is a judgment favoring the defendant), or it could represent a decision to reverse the lower court's decision (where $y = 0$ is a decision to affirm). To make our exposition a bit more concrete, we will refer to $y = 1$ as a decision to reverse and $y = 0$ as a decision to affirm. 

Panels may or may not disagree on how to resolve certain cases. Formally, we assume there is a mapping $\hat{x} : \mathcal{P} \times \mathcal{F} \to \mathbb{R}$, where $\hat{x}(p,\mathbf{f})$ specifies an ideal point for a panel $p \in \mathcal{P}$ for cases with case features $\mathbf{f} \in \mathcal{F}$. In other words, for each panel, each combination of case features induces not only a case space, but also an ideal point on that case space for each panel. This captures the important substantive idea that a panel might approach different kinds of cases differently. For example, a panel may be inclined to reverse most civil rights decisions, but not most torts decisions. 

To keep our notation a bit more tidy (and similar to other spatial models), we will write the feature-specific ideal points using subscripts to indicate the panel, e.g., $\hat{x}_p(\mathbf{f})$. Moreover, when we consider multiple panels, such as $p_1$ and $p_2$, we will simply use the panels' numerical subscripts, e.g. $\hat{x}_1(\mathbf{f})$ for $p_1$ and $\hat{x}_2(\mathbf{f})$ for $p_2$. 

Finally, we define a function $a : \mathcal{P} \times \mathcal{F} \to \mathbb{R}$, that returns the (feature-specific) ideal point of the panel actually assigned to a case.

We can now formally define the utility function for each panel $p$:
\begin{align*}
    u_p(y, {x}, \mathbf{f}) = \begin{dcases}
        1 &\text{if }(y = 1\text{ and }{x} \leq \hat{x}_p(\mathbf{f})) \text{ or }
        (y = 0\text{ and }{x} > \hat{x}_p(\mathbf{f}))\\
        0 &\text{otherwise}
    \end{dcases}
\end{align*}
Note that this is a standard utility function for a case space model.\footnote{This utility function embeds an assumption that the panel strictly prefers reversal whenever otherwise indifferent, i.e., when ${x} = \hat{x}_p(\mathbf{f})$.} It says that a panel strictly prefers that a case is reversed when ${x} \leq \hat{x}_p(\mathbf{f})$, and strictly prefers that a case is affirmed if ${x} > \hat{x}_p(\mathbf{f})$. 

\paragraph{Sequence.}
The model begins with a case $i$ being filed and proceeds as follows.

\begin{enumerate} 
    \item Case $i$'s case feature vector $\mathbf{f}_i$ is publicly drawn according to a known distribution
    \item A panel $p \in \mathcal{P}$ is assigned to the case according to the assignment mechanism
    \item The case's fact pattern ${x}_i$ is drawn and revealed to the panel
    \item The panel renders a judgment $y_i(p)\in \{0,1\}$
\end{enumerate}

Of course, since the panel chooses whether the case is reversed or not, its decision making is straight-forward in this model. Working backward, the assigned panel reverses if ${x} \leq \hat{x}_p(\mathbf{f})$ and affirms if ${x} > \hat{x}_p(\mathbf{f})$. Now we can formally characterize the ``equilibrium'' of this model:\footnote{We put the word ``equilibrium'' in quotes since there is no strategic interaction in this model, and thus our solution is a simple utility maximizing problem for each panel.}

\begin{lemma}\label{lem:eqm} For a case $i$, each panel $p \in \mathcal{P}$ chooses $y_i^* = 1$ if ${x}_i \leq \hat{x}_p(\mathbf{f}_i)$, and $y_i^* = 0$ otherwise. More formally:
\begin{align*}
    y_i^*(p,x,\mathbf{f}) = 
    \begin{dcases}
        1 &\text{if }{x}_i \leq \hat{x}_p(\mathbf{f}_i)\\
        0 &\text{if }{x}_i > \hat{x}_p(\mathbf{f}_i)
    \end{dcases}
\end{align*}
\end{lemma}

\begin{proof}[Proof of Lemma 3]
    This follows directly from maximizing panels' utility functions. 
\end{proof}

Our goal in this paper is to quantify how much judges matter for case outcomes. We offer a clear theoretical definition of what it means for judges to ``matter.'' 

\begin{defn}
    We say that \textbf{judges matter for the outcome of case $i$} if and only if $|y_i^*(p_1,x_i,\mathbf{f}_i) - y_i^*(p_2,x_i,\mathbf{f}_i)| = 1$ for any $p_1, p_2 \in \mathcal{P}$ where $p_1 \neq p_2$. 
\end{defn}

\section{Formal Proofs}\label{app:proofs}

\begin{proof}[Proof of Lemma 1] Suppose a population of cases $\mathcal{M}$ satisfies monotonicity. Then it follows from the definition of monotonicity (Definition 3) that there are two mutually exclusive and exhaustive cases to consider: (1) $\hat{x}_1 < \hat{x}_2$ for all $i \in \mathcal{M}$, and (2) $\hat{x}_1 > \hat{x}_2$ for all $i \in \mathcal{M}$. 

Consider case 1, which implies that $Y_i(p_1) \leq Y_i(p_2)$ for all $i \in \mathcal{M}$. Then, $|Y_i(p_1) - Y_i(p_2)| = Y_i(p_2) - Y_i(p_1) \geq 0$ for all $i \in \mathcal{M}$, and:
\begin{align*}
\delta(p_1,p_2) 
\equiv &~
\mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)|\big]\\
& ~~= \mathbb{E}\big[Y_i(p_2) - Y_i(p_1)\big] \\
& ~~= |\mathbb{E}\big[Y_i(p_2) - Y_i(p_1)\big]| \\
& ~~= |\mathbb{E}\big[Y_i(p_1) - Y_i(p_2)\big]| \\
& ~~= \phi(p_1,p_2)
\end{align*}
For case 2, the same logic shows that $\delta(p_1,p_2) = |\mathbb{E}\big[Y_i(p_1) - Y_i(p_2)\big]|$, thus proving the lemma.
\end{proof}

\begin{proof}[Proof of Lemma 2] Suppose a population of cases satisfies random assignment. Then:
\begin{align*}
    D(p_1,p_2) 
    &= \big|\mathbb{E}[Y_i|p_1] - \mathbb{E}[Y_i|p_2]\big|\\
    & = \big|\mathbb{E}[Y_i(p_1)|p_1] - \mathbb{E}[Y_i(p_2)|p_2]\big|\\
    & = \big|\mathbb{E}[Y_i(p_1)|p_1] - \mathbb{E}[Y_i(p_2)|p_1]\big|\\
    & = \big|\mathbb{E}[Y_i(p_1) - Y_i(p_2)|p_1]\big|\\
    & = \big|\mathbb{E}[Y_i(p_1) - Y_i(p_2)]\big|\\
    & = \phi(p_1,p_2)
\end{align*}
Line 1 is the definition of $D$. Line 2 follows from the fact that $Y(p_1)$ is observed when panel $p_1$ is assigned and $Y(p_2)$ is observed when panel $p_2$ is assigned. Line 3 follows from random assignment. Line 4 follows from properties of expectations. Line 5 again follows from random assignment. Line 6 follows from the definition of $\phi$.
\end{proof}

\begin{proof}[Proof of Proposition 1]
    Follows from previous results and the fact that $\widehat{D}$ is an unbiased and consistent estimator for $D$ \parencite[see Theorem 16.3 in][]{Wasserman2004}.
\end{proof}

\begin{proof}[Proof of Proposition 2] Using the definition of $D(p_1,p_2)$ and performing some algebraic manipulations, yields
\begin{align*}
    D(p_1,p_2) 
    &= \big|\mathbb{E}[Y_i|p_1] - \mathbb{E}[Y_i|p_2]\big|\\
    & = \big|\mathbb{E}[Y_i(p_1)|p_1] - \mathbb{E}[Y_i(p_2)|p_2]\big|\\
    & = \big|\mathbb{E}[Y_i(p_1)|p_1] - \mathbb{E}[Y_i(p_2)|p_1] + 
    \mathbb{E}[Y_i(p_2)|p_1]
    - \mathbb{E}[Y_i(p_2)|p_2] \big|
\end{align*}
The last line is only equal to $\phi(p_1,p_2)$ if and only if $\mathbb{E}[Y_i(p_2)|p_1] - \mathbb{E}[Y_i(p_2)|p_2] = 0$. Otherwise, $D(p_1,p_2) \neq \phi(p_1,p_2)$.
\end{proof}


\begin{proof}[Proof of Proposition 3] 
    Suppose a population of cases $\mathcal{M}$ does not satisfy monotonicity. From the definition of monotonicity, this implies that there exist two subsets of cases, $\mathcal{M}_1$ and $\mathcal{M}_2$, where for all $i \in \mathcal{M}_1$, $\hat{x}_1 < \hat{x}_2$ and for all $i \in \mathcal{M}_2$, $\hat{x}_1 > \hat{x}_2$.
    \begin{align*}
    \delta(p_1,p_2) 
    \equiv &~
    \mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)|\big]\\
    & ~~= \mathbb{P}(\mathcal{M}_1)\mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)| | \mathcal{M}_1\big] + \mathbb{P}(\mathcal{M}_2)\mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)| | \mathcal{M}_2\big]\\
    & ~~= \mathbb{P}(\mathcal{M}_1)\mathbb{E}\big[Y_i(p_2) - Y_i(p_1) | \mathcal{M}_1\big] + \mathbb{P}(\mathcal{M}_2)\mathbb{E}\big[Y_i(p_1) - Y_i(p_2) | \mathcal{M}_2\big]\\
    & ~~= \mathbb{P}(\mathcal{M}_1)|\mathbb{E}\big[Y_i(p_2) - Y_i(p_1) | \mathcal{M}_1\big]|
    + \mathbb{P}(\mathcal{M}_2)|\mathbb{E}\big[Y_i(p_1) - Y_i(p_2) | \mathcal{M}_2\big]|\\
    & ~~= \mathbb{P}(\mathcal{M}_1)|\mathbb{E}\big[Y_i(p_1) - Y_i(p_2) | \mathcal{M}_1\big]|
    + \mathbb{P}(\mathcal{M}_2)|\mathbb{E}\big[Y_i(p_1) - Y_i(p_2) | \mathcal{M}_2\big]|\\
    & ~~> |\mathbb{P}(\mathcal{M}_1)\underbrace{\mathbb{E}\big[Y_i(p_1) - Y_i(p_2)|\mathcal{M}_1\big]}_{=-|\mathbb{E}[Y_i(p_1) - Y_i(p_2) | \mathcal{M}_1]|<0}+
    \mathbb{P}(\mathcal{M}_2)\underbrace{\mathbb{E}\big[Y_i(p_1) - Y_i(p_2)|\mathcal{M}_2\big]|}_{=|\mathbb{E}[Y_i(p_1) - Y_i(p_2) | \mathcal{M}_2]|>0}\\
    & ~~= |\mathbb{E}\big[Y_i(p_1) - Y_i(p_2)\big]| \\
    & ~~= \phi(p_1,p_2)
    \end{align*}
\end{proof}

\begin{lemma}\label{lem:ordering}
$\mathbb{E}[Y_i|p_1, \mathbf{f}] < \mathbb{E}[Y_i|p_2, \mathbf{f}]$ if and only if $\hat{x}_1(\mathbf{f}) < \hat{x}_2(\mathbf{f})$.
\end{lemma}

\begin{proof}[Proof of Lemma 4] From Lemma 3, 
\begin{align*}
    \mathbb{E}[Y_i|p_1,\mathbf{f}] < \mathbb{E}[Y_i|p_2, \mathbf{f}]
    \Longleftrightarrow 
    \mathbb{P}(x_i \leq \hat{x}_1 | \mathbf{f}) < \mathbb{P}(x_i \leq \hat{x}_2 | \mathbf{f})
    \Longleftrightarrow
    \hat{x}_1(\mathbf{f}) < \hat{x}_2(\mathbf{f})
\end{align*}
And the last equivalence follows from the fact that $x_i$ is drawn from a fixed distribution induced by $\mathbf{f}$ regardless of whether $p_1$ or $p_2$ is assigned to the case.
\end{proof}

\begin{proof}[Proof of Proposition 4] Suppose that a population of cases satisfies random assignment. We show directly that $\delta(p_1,p_2) = M(p_1,p_2)$ as follows:
\begin{align*}
\delta(p_1,p_2) 
\equiv &~
\mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)|\big]\\
& ~~= \Pr(\hat{x}_1 \leq \hat{x}_2)\mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)| | \hat{x}_1 \leq \hat{x}_2\big]\\
& ~~~~~~~~+ \Pr(\hat{x}_1 > \hat{x}_2)\mathbb{E}\big[|Y_i(p_1) - Y_i(p_2)| | \hat{x}_1 > \hat{x}_2\big]\\
& ~~= \Pr(\hat{x}_1 \leq \hat{x}_2)\mathbb{E}\big[Y_i(p_2) - Y_i(p_1) | \hat{x}_1 \leq \hat{x}_2\big]\\
& ~~~~~~~~+ 
\Pr(\hat{x}_1 > \hat{x}_2)\mathbb{E}\big[Y_i(p_1) - Y_i(p_2) | \hat{x}_1 > \hat{x}_2\big]\\
& ~~= \Pr(\hat{x}_1 \leq \hat{x}_2)
\{\mathbb{E}\big[Y_i | p_2, \hat{x}_1 \leq \hat{x}_2\big] - \mathbb{E}\big[Y_i | p_1, \hat{x}_1 \leq \hat{x}_2\big]\} \\
& ~~~~~~~~+ 
\Pr(\hat{x}_1 > \hat{x}_2)
\{\mathbb{E}\big[Y_i | p_1, \hat{x}_1 > \hat{x}_2\big] - \mathbb{E}\big[Y_i | p_2, \hat{x}_1 > \hat{x}_2\big]\}\\
& ~~= \Pr(\hat{x}_1 \leq \hat{x}_2)
\{\mathbb{E}\big[Y_i | m_i(p_1,p_2) = 1\big] - \mathbb{E}\big[Y_i | m_i(p_1,p_2) = 0\big]\} \\
& ~~~~~~~~+ 
\Pr(\hat{x}_1 > \hat{x}_2)
\{\mathbb{E}\big[Y_i | m_i(p_1,p_2) = 1\big] - \mathbb{E}\big[Y_i | m_i(p_1,p_2) = 0\big]\}\\
& ~~= \Pr(\hat{x}_1 \leq \hat{x}_2)\mathbb{E}\big[Y_i | m_i(p_1,p_2) = 1\big] 
- \Pr(\hat{x}_1 \leq \hat{x}_2)\mathbb{E}\big[Y_i | m_i(p_1,p_2) = 0\big] \\
& ~~~~~~~~+ 
\Pr(\hat{x}_1 > \hat{x}_2)\mathbb{E}\big[Y_i | m_i(p_1,p_2) = 1\big] 
- \Pr(\hat{x}_1 > \hat{x}_2)\mathbb{E}\big[Y_i | m_i(p_1,p_2) = 0\big]\\
& ~~= \mathbb{E}\big[Y_i | m_i(p_1,p_2) = 1\big] - \mathbb{E}\big[Y_i | m_i(p_1,p_2) = 0\big]\\
& ~~\equiv M(p_1,p_2)
\end{align*}
The first equality is a decomposition of the first expectation. The second equality follows from the fact that $Y_i(p_2) > Y_i(p_1)$ if and only if $\hat{x}_1 \leq \hat{x}_2$, allowing for us to get rid of the absolute values. The third equality follows from properties of expectations and the fact that random assignment is satisfied (see Lemma 2). The fourth equality follows from the definition of $m_i(p_1,p_2)$, Definition 5. The final two equalities are algebra.
\end{proof}

\begin{proof}[Proof of Proposition 5] Let $\mu(\widehat{m}_i, m_i)$ be the probability that a case $i$ with true treatment assignment $m_i$ is classified as $\widehat{m}_i$. Then, we can decompose $\widehat{M}(p_1,p_2)$ as follows,
\begin{align*}
\widehat{M}(p_1,p_2) &= 
\mathbb{E}\big[Y_i | \widehat{m}_i = 1\big] - \mathbb{E}\big[Y_i | \widehat{m}_i(p_1,p_2) = 0\big] \\
& = \mu(1,1)\mathbb{E}\big[Y_i | \widehat{m}_i = 1, m_i = 1\big] + \mu(1,0)\mathbb{E}\big[Y_i | \widehat{m}_i = 1, m_i = 0\big]\\ 
&~~~~~~~ - \{\mu(0,0)\mathbb{E}\big[Y_i | \widehat{m}_i = 0, m_i = 0\big] + \mu(0,1)\mathbb{E}\big[Y_i | \widehat{m}_i = 0, m_i = 1\big]\}\\
& = \mu(1,1)\mathbb{E}\big[Y_i | m_i = 1\big] + \mu(1,0)\mathbb{E}\big[Y_i | m_i = 0\big]\\ 
&~~~~~~~ - \{\mu(0,0)\mathbb{E}\big[Y_i | m_i = 0\big] + \mu(0,1)\mathbb{E}\big[Y_i | m_i = 1\big]\}
\end{align*}
By the construction of $m_i$, it follows that $\mathbb{E}\big[Y_i | m_i = 1\big] > \mathbb{E}\big[Y_i | m_i = 1\big]$. Moreover, since $\mu(1,1) + \mu(0,1) = 1$ and $\mu(1,0) + \mu(0,0) = 1$, then if there is misclassification,
\begin{align*}
\mu(1,1)\mathbb{E}\big[Y_i | m_i = 1\big] + \mu(1,0)\mathbb{E}\big[Y_i | m_i = 0\big] <  \mathbb{E}\big[Y_i | m_i = 1\big] \text{ for all }\mu(1,0)>0
\end{align*}
and
\begin{align*}
\mu(0,0)\mathbb{E}\big[Y_i | m_i = 0\big] + \mu(0,1)\mathbb{E}\big[Y_i | m_i = 1\big] > \mathbb{E}\big[Y_i | m_i = 0\big] 
 \text{ for all }\mu(0,1)>0
\end{align*}
Then, $\widehat{M}(p_1,p_2) < \mathbb{E}\big[Y_i | m_i = 1\big] - \mathbb{E}\big[Y_i | m_i = 0\big]$.
\end{proof}

\section{Estimation Details for PRQs}\label{app:model-details}

As we describe in the main text, our process involved using machine learning methods for two separate prediction exercises. 

For the first, we predicted each case's outcome (i.e., whether it was reversed or not) using \textit{only} case characteristics. With these predictions, we generate a new outcome variable for each case, which we refer to as that case's ``deviance,'' which is calculated by subtracting the predicted probabilities of reversal from the actual outcome variable. We interpret this new outcome as the extent to which a case's outcome deviates from what would be expected for the court as a whole.

For the second, we predicted each case's deviance using panel and judge characteristics, as well as a small set of carefully chosen case characteristics.\footnote{These case characteristics were already used to train the model that generated the deviance scores, so it may be surprising that we include them again. However, since the first model did not include any panel characteristics, it was unable to learn from potentially salient \textit{interactions} between case characteristics and panel characteristics. Our ``carefully chosen'' case characteristics are those whose predictive power may be conditional on panel characteristics, and thus useful for further improving our predictions in the second prediction exercise.} This prediction exercise yields a model that allows us to predict each case's deviance for every panel of judges that \textit{could have been} assigned to the case (including the one that was actually assigned). 

For both prediction exercises, we used a stacking ensemble with several base learners (described below), and we generated all predictions out of sample.

\subsection{Lists of Predictors}\label{app:predictors}

Most of the variables in our analysis come from public databases available on the Federal Judicial Center website. In particular:

\begin{itemize}
    \item The Integrated Database (available at \url{https://www.fjc.gov/research/idb}), which contains case level data from the Ninth Circuit as well as from all the District Courts in the Ninth Circuit. 
    \item The Biographical Directory of Article III Federal Judges (available at \url{https://www.fjc.gov/history/judges}), which contains biographical and appointment data for all Article III judges appointed since 1789. 
\end{itemize}

We compiled original data on which judges sat in the three-judge panels overseeing each case using docket sheets available from PACER. We also compiled DIME scores from \textcite{bonica_2017}.

\input{../Outputs/tableC1.tex}

\input{../Outputs/tableC2.tex}

\subsection{Details On Our Machine Learning Models}\label{app:ml-details}

The panel-free model predictions (used to residualize the outcome) are generated from a stacking ensemble using the H2O \textcite{h2o_R_package} package for R. The panel-free model consists of six base learners. We use default parameters unless otherwise noted. We selected parameters for each base learner via testing of performance on data that was not used in the analysis (i.e., data that was dropped because we were not confident that the cases were randomly assigned to panels). The base learners are (1) a gradient boosting machine with 1,000 trees, a max depth of three, minimum rows of two, a learning rate of 0.1, a column sample rate of one, and a sample rate of 1; (2) a gradient boosting machine with 1,000 trees, a max depth of two, minimum rows of two, a learning rate of 0.1, a column sample rate of .8, and a sample rate of .8; (3) a LASSO regression; (4) A ridge regression; (5) an XGBoost with 1,000 trees, a max depth of four, and a learning rate of .01; and (6) a random forest with 300 trees and minimum rows of 2.


The model to predict residualized outcomes is an ensemble stacking algorithm with five base learners. We selected parameters for each base learner via testing of performance on data that was not used in the analysis. We use default parameters unless otherwise noted (1) a LASSO regression; (2) a ridge regression; (3) a gradient boosting machine with 500 trees, a max depth of two, minimum rows of one, a learning rate of 0.1, a column sample rate of .1, and a sample rate of .6; (4) a gradient boosting machine with 1000 trees, a max depth of three, minimum rows of one, a learning rate of 0.1, a column sample rate of .1, and a sample rate of .7, and a column sample rate per tree of 0.2; and (5) a random forest with 500 tress, minimum rows of 100, and an mtry of 10.

The panel-free model for the Ninth Circuit has an AUC of .59. Adding the predicted panel deviances increases the AUC to .62.

\printbibliography 

\newpage % Required to preserve page number formatting

\end{refsection}

\end{document}
